{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "provenance": []
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "language_info": {
      "name": "python"
    },
    "gpuClass": "standard",
    "accelerator": "GPU"
  },
  "cells": [
    {
      "cell_type": "markdown",
      "source": [
        "# Embeddings index components\n",
        "\n",
        "The main components of txtai are `embeddings`, `pipeline`, `workflow` and an `api`. The following shows the top level view of the txtai src tree.\n",
        "\n",
        "```\n",
        "Abbreviated listing of src/txtai\n",
        " ann\n",
        " api\n",
        " database\n",
        " embeddings\n",
        " pipeline\n",
        " scoring\n",
        " vectors\n",
        " workflow\n",
        "```\n",
        "\n",
        "One might ask, why are `ann`, `database`, `scoring` and `vectors` top level packages and not under the `embeddings` package? The `embeddings` package provides the glue between these components, making everything easy to use. The reason is that each of these packages are modular and can be used on their own! \n",
        "\n",
        "This notebook will go through a series of examples demonstrating how these components can be used standalone as well as combined together to build custom search indexes.\n",
        "\n",
        "_Note: This is intended as a deep dive into txtai `embeddings` components. There are much simpler high-level APIs for standard use cases._"
      ],
      "metadata": {
        "id": "-xU9P9iSR-Cy"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Install dependencies\n",
        "\n",
        "Install `txtai` and all dependencies."
      ],
      "metadata": {
        "id": "shlUi2kKS7KT"
      }
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "xEvX9vCpn4E0"
      },
      "outputs": [],
      "source": [
        "%%capture\n",
        "!pip install git+https://github.com/neuml/txtai datasets"
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Load dataset\n",
        "\n",
        "This example will use the `ag_news` dataset, which is a collection of news article headlines."
      ],
      "metadata": {
        "id": "408IyXzKFSiG"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "from datasets import load_dataset\n",
        "\n",
        "dataset = load_dataset(\"ag_news\", split=\"train\")"
      ],
      "metadata": {
        "id": "IQ_ns6YvFRm1"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Approximate nearest neighbor (ANN) and Vectors\n",
        "\n",
        "In this section, we'll use the `ann` and `vectors` package to build a similarity index over the `ag_news` dataset.\n",
        "\n",
        "The first step is vectorizing the text. We'll use a `sentence-transformers` model. "
      ],
      "metadata": {
        "id": "AtEdP7Utw3mk"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "import numpy as np\n",
        "\n",
        "from txtai.vectors import VectorsFactory\n",
        "\n",
        "model = VectorsFactory.create({\"path\": \"sentence-transformers/all-MiniLM-L6-v2\"}, None)\n",
        "\n",
        "embeddings = []\n",
        "\n",
        "# List of all text elements\n",
        "texts = dataset[\"text\"]\n",
        "\n",
        "# Create embeddings buffer, vector model has 384 features\n",
        "embeddings = np.zeros(dtype=np.float32, shape=(len(texts), 384))\n",
        "\n",
        "# Vectorize text in batches\n",
        "batch, index, batchsize = [], 0, 128\n",
        "for text in texts:\n",
        "  batch.append(text)\n",
        "\n",
        "  if len(batch) == batchsize:\n",
        "    vectors = model.encode(batch)\n",
        "    embeddings[index : index + vectors.shape[0]] = vectors\n",
        "    index += vectors.shape[0]\n",
        "    batch = []\n",
        "\n",
        "# Last batch\n",
        "if batch:\n",
        "    vectors = model.encode(batch)\n",
        "    embeddings[index : index + vectors.shape[0]] = vectors\n",
        "\n",
        "# Normalize embeddings\n",
        "embeddings /= np.linalg.norm(embeddings, axis=1)[:, np.newaxis]\n",
        "\n",
        "# Print shape\n",
        "embeddings.shape"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "DPWrubv5oOn7",
        "outputId": "972c1837-2404-42f4-9a5e-51f3c3f149ee"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "(120000, 384)"
            ]
          },
          "metadata": {},
          "execution_count": 3
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "Next we'll build a vector index using these embeddings!"
      ],
      "metadata": {
        "id": "SDaDLMyXLGe1"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "from txtai.ann import ANNFactory\n",
        "\n",
        "# Create Faiss index using normalized embeddings\n",
        "ann = ANNFactory.create({\"backend\": \"faiss\"})\n",
        "ann.index(embeddings)\n",
        "\n",
        "# Show total\n",
        "ann.count()"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "ILSfWHxVHex0",
        "outputId": "b9da6a79-778f-4338-a6b5-d693772fcdae"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "120000"
            ]
          },
          "metadata": {},
          "execution_count": 4
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "Now let's run a search."
      ],
      "metadata": {
        "id": "B_XnpIpXNKSP"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "query = model.encode([\"best planets to explore for life\"])\n",
        "query /= np.linalg.norm(query)\n",
        "\n",
        "for uid, score in ann.search(query, 3)[0]:\n",
        "  print(uid, texts[uid], score)"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "c2FVQlxSLKgP",
        "outputId": "825ca5de-d765-4c67-fdde-c7fe06557095"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "17752 Rocky Road: Planet hunting gets closer to Earth Astronomers have discovered the three lightest planets known outside the solar system, moving researchers closer to the goal of finding extrasolar planets that resemble Earth. 0.599043607711792\n",
            "16158 Earth #39;s  #39;big brothers #39; floating around stars Washington - A new class of planets has been found orbiting stars besides our sun, in a possible giant leap forward in the search for Earth-like planets that might harbour life. 0.5688529014587402\n",
            "45029 Coming Soon: \"Good\" Jupiters Most of the extrasolar planets discovered to date are gas giants like Jupiter, but their orbits are either much closer to their parent stars or are highly eccentric. Planet hunters are on the verge of confirming the discovery of Jupiter-size planets with Jupiter-like orbits. Solar systems that contain these \"good\" Jupiters may harbor habitable Earth-like planets as well. 0.5606889724731445\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "And there it is, a full vector search system without using the `embeddings` package.\n",
        "\n",
        "Just as a reminder, the following much simpler code does the same thing with an Embeddings instance."
      ],
      "metadata": {
        "id": "00dnum6fNNM0"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "from txtai.embeddings import Embeddings\n",
        "\n",
        "embeddings = Embeddings({\"path\": \"sentence-transformers/all-MiniLM-L6-v2\"})\n",
        "embeddings.index((x, text, None) for x, text in enumerate(texts))\n",
        "\n",
        "for uid, score in embeddings.search(\"best planets to explore for life\"):\n",
        "  print(uid, texts[uid], score)"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "OYAqPoTmNaNN",
        "outputId": "30b6305a-11da-4439-e0f5-646aef8d96f6"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "17752 Rocky Road: Planet hunting gets closer to Earth Astronomers have discovered the three lightest planets known outside the solar system, moving researchers closer to the goal of finding extrasolar planets that resemble Earth. 0.599043607711792\n",
            "16158 Earth #39;s  #39;big brothers #39; floating around stars Washington - A new class of planets has been found orbiting stars besides our sun, in a possible giant leap forward in the search for Earth-like planets that might harbour life. 0.568852961063385\n",
            "45029 Coming Soon: \"Good\" Jupiters Most of the extrasolar planets discovered to date are gas giants like Jupiter, but their orbits are either much closer to their parent stars or are highly eccentric. Planet hunters are on the verge of confirming the discovery of Jupiter-size planets with Jupiter-like orbits. Solar systems that contain these \"good\" Jupiters may harbor habitable Earth-like planets as well. 0.560688853263855\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Database\n",
        "\n",
        "When the `content` parameter is enabled, an Embeddings instance stores both vector content and raw content in a database. But the `database` package can be used standalone too."
      ],
      "metadata": {
        "id": "KfGc7iNRO0Tw"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "from txtai.database import DatabaseFactory\n",
        "\n",
        "# Load content into database\n",
        "database = DatabaseFactory.create({\"content\": True})\n",
        "database.insert((x, row, None) for x, row in enumerate(dataset))\n",
        "\n",
        "# Show total\n",
        "database.search(\"select count(*) from txtai\")"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "9IwcIgocPSDr",
        "outputId": "eeceee2f-cf35-414f-b9e4-ad975c118e42"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "[{'count(*)': 120000}]"
            ]
          },
          "metadata": {},
          "execution_count": 7
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "The full txtai [SQL query syntax](https://neuml.github.io/txtai/embeddings/query/#sql) is available, including working with dynamically created fields."
      ],
      "metadata": {
        "id": "l18dapNgSqnA"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "database.search(\"select count(*), label from txtai group by label\")"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "7yPjcT3peDnZ",
        "outputId": "241c98df-9c24-47a1-8ca5-e5bace19abfb"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "[{'count(*)': 30000, 'label': 0},\n",
              " {'count(*)': 30000, 'label': 1},\n",
              " {'count(*)': 30000, 'label': 2},\n",
              " {'count(*)': 30000, 'label': 3}]"
            ]
          },
          "metadata": {},
          "execution_count": 8
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "Let's run a query to find text containing the word planets."
      ],
      "metadata": {
        "id": "S8G09Ib0kRB9"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "for row in database.search(\"select id, text from txtai where text like '%planets%' limit 3\"):\n",
        "  print(row[\"id\"], row[\"text\"])"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "Aq1ZvOhvQHRO",
        "outputId": "3c2f9c94-57b1-489d-c854-60b909187ff0"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "100 Comets, Asteroids and Planets around a Nearby Star (SPACE.com) SPACE.com - A nearby star thought to harbor comets and asteroids now appears to be home to planets, too. The presumed worlds are smaller than Jupiter and could be as tiny as Pluto, new observations suggest.\n",
            "102 Redesigning Rockets: NASA Space Propulsion Finds a New Home (SPACE.com) SPACE.com - While the exploration of the Moon and other planets in our solar system is nbsp;exciting, the first task for astronauts and robots alike is to actually nbsp;get to those destinations.\n",
            "272 Sharpest Image Ever Obtained of a Circumstellar Disk Reveals Signs of Young Planets MAUNA KEA, Hawaii -- The sharpest image ever taken of a dust disk around another star has revealed structures in the disk which are signs of unseen planets.     Dr...\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "Since this is just a SQL database, text search is quite limited. The query above just retrieved results with the word planets in it."
      ],
      "metadata": {
        "id": "xhND31tnkOY2"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Scoring\n",
        "\n",
        "Since the original txtai release, there has been a `scoring` package. The main use case for this package is building a weighted sentence embeddings vector when using word vector models. But this package can also be used standalone to build BM25, TF-IDF and/or SIF text indexes."
      ],
      "metadata": {
        "id": "-vNVSA2FQnKj"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "from txtai.scoring import ScoringFactory\n",
        "\n",
        "# Build index\n",
        "scoring = ScoringFactory.create({\"method\": \"bm25\", \"terms\": True, \"content\": True})\n",
        "scoring.index((x, text, None) for x, text in enumerate(texts))\n",
        "\n",
        "# Show total\n",
        "scoring.count()"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "LJ2FskiiQ_l_",
        "outputId": "d023191a-b9bc-47fa-be99-375b81f02e8e"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "120000"
            ]
          },
          "metadata": {},
          "execution_count": 10
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "for row in scoring.search(\"planets explore life earth\", 3):\n",
        "  print(row[\"id\"], row[\"text\"], row[\"score\"])"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "slLtmzfbRuf6",
        "outputId": "be7ba16e-fd6d-4c30-fb43-a4617bac21ca"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "16327 3 Planets Are Found Close in Size to Earth, Making Scientists Think 'Life' A trio of newly discovered worlds are much smaller than any other planets previously discovered outside of the solar system. 17.768332448130707\n",
            "16158 Earth #39;s  #39;big brothers #39; floating around stars Washington - A new class of planets has been found orbiting stars besides our sun, in a possible giant leap forward in the search for Earth-like planets that might harbour life. 17.65941968170793\n",
            "16620 New Planets could advance search for Life Astronomers in Europe and the United States have found two new planets about 20 times the size of Earth beyond the solar system. The discovery might be a giant leap forward in  17.65941968170793\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "The search above ran a BM25 search across the dataset. The search will return more keyword/literal results. With proper query construction, the results can be decent.\n",
        "\n",
        "Comparing the vector search results earlier and these results are a good lesson in the differences between keyword and vector search."
      ],
      "metadata": {
        "id": "BOW5JlxYS3Rm"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Database and Scoring\n",
        "\n",
        "Earlier we showed how the `ann` and `vectors` components can be combined to build a vector search engine. Can we combine the `database` and `scoring` components to add keyword search to a database? Yes!"
      ],
      "metadata": {
        "id": "gqEBeBEoTrup"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "def search(query, limit=3):\n",
        "  # Get similar clauses, if any\n",
        "  similar = database.parse(query).get(\"similar\")\n",
        "  return database.search(query, [scoring.search(args[0], limit * 10) for args in similar] if similar else None, limit)\n",
        "\n",
        "# Rebuild scoring - only need terms index\n",
        "scoring = ScoringFactory.create({\"method\": \"bm25\", \"terms\": True})\n",
        "scoring.index((x, text, None) for x, text in enumerate(texts))\n",
        "\n",
        "for row in search(\"select id, text, score from txtai where similar('planets explore life earth') and label = 0\"):\n",
        "  print(row[\"id\"], row[\"text\"], row[\"score\"])"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "MEpZAr0TUMCK",
        "outputId": "b320604b-f2a0-4546-c91d-2d2b0c449382"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "15363 NASA to Announce New Class of Planets Astronomers have discovered four new planets in a week's time, an exciting end-of-summer flurry that signals a sharper era in the hunt for new worlds.    While none of these new bodies would be mistaken as Earth's twin, some appear to be noticeably smaller and more solid - more like Earth and Mars - than the gargantuan, gaseous giants identified before... 12.582923259697132\n",
            "15900 Astronomers Spot Smallest Planets Yet American astronomers say they have discovered the two smallest planets yet orbiting nearby stars, trumping a small planet discovery by European scientists five days ago and capping the latest round in a frenzied hunt for other worlds like Earth.    All three of these smaller planets belong to a new class of \"exoplanets\" - those that orbit stars other than our sun, the scientists said in a briefing Tuesday... 12.563928231067155\n",
            "15879 Astronomers see two new planets US astronomers find the smallest worlds detected circling other stars and say it is a breakthrough in the search for life in space. 12.078383982352994\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "And there it is, scoring-based similarity search with the same syntax as standard txtai vector queries, including additional filters!\n",
        "\n",
        "txtai is built on vector search, machine learning and finding results based on semantic meaning. It's been well-discussed from a functionality standpoint how vector search has many advantages over keyword search. The one advantage keyword search has is speed. "
      ],
      "metadata": {
        "id": "aPaZsoxnYW8I"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Wrapping up\n",
        "\n",
        "This notebook walked through each of the packages used by an Embeddings index. The Embeddings index makes this all transparent and easy to use. But each of the components do stand on their own and can be individually integrated into a project!"
      ],
      "metadata": {
        "id": "4L8smyyXc8q8"
      }
    }
  ]
}